Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration for caching embeddings #27

Merged
merged 26 commits into from
May 30, 2024

Conversation

GabrieleGhisleni
Copy link
Contributor

@GabrieleGhisleni GabrieleGhisleni commented May 13, 2024

Hi all, following our recent discussion in Issue #19334, I have opened this pull request to specifically adds the caching backend for embeddings.

@GabrieleGhisleni GabrieleGhisleni changed the title add cache embeddings cache embeddings May 13, 2024
@GabrieleGhisleni GabrieleGhisleni changed the title cache embeddings integration for caching embeddings May 13, 2024
@maxjakob
Copy link
Collaborator

@GabrieleGhisleni to fix the Python 3.8 issues we need to from typing import Dict, List and use these two classes instead of dict and list in type annotations.

@maxjakob
Copy link
Collaborator

@GabrieleGhisleni Thank you for contributing! Let me know when this is ready for review, happy to take a look.

@GabrieleGhisleni
Copy link
Contributor Author

@maxjakob Thank you for your willingness to review! The changes are not quite ready yet, but I'll let you know as soon as they are. I appreciate your patience!

@GabrieleGhisleni GabrieleGhisleni marked this pull request as ready for review May 22, 2024 14:27
@GabrieleGhisleni
Copy link
Contributor Author

@maxjakob The pull request should be ready to be reviewed.

@maxjakob maxjakob self-requested a review May 23, 2024 10:43
Copy link
Collaborator

@maxjakob maxjakob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job so far! I'm mainly wondering if there is a hard requirement to store the vectors as lists of floats or if we can store them as bytes and integrate a little nicer with how other caches work?

libs/elasticsearch/README.md Outdated Show resolved Hide resolved
libs/elasticsearch/README.md Outdated Show resolved Hide resolved

Caching embeddings is obtained by using the [CacheBackedEmbeddings](https://python.langchain.com/docs/modules/data_connection/text_embedding/caching_embeddings),
in a slightly different way than the official documentation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add something on how (and maybe why) it is different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"In a different way" because the documentation always mentions only the from_bytes_store method, whereas we show that it can be instantiated directly as an argument of CacheBackedEmbeddings.

libs/elasticsearch/langchain_elasticsearch/cache.py Outdated Show resolved Hide resolved
libs/elasticsearch/langchain_elasticsearch/cache.py Outdated Show resolved Hide resolved
libs/elasticsearch/langchain_elasticsearch/cache.py Outdated Show resolved Hide resolved
libs/elasticsearch/pyproject.toml Outdated Show resolved Hide resolved
Comment on lines +402 to +405
if self._metadata is not None:
body["metadata"] = self._metadata
if self._store_input:
body["text_input"] = text_input
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you want to store these so you can inspect the cache and learn something about usage statistics? Wondering if this is the right way to do this (or rather collect stats differently), but I'm not opposed to it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core idea behind storing both the text and embeddings together is to ensure that we have the possibility to reconstruct the model or utilize them in other ways in the future. It is just to preserve valuable information for potential future use cases.

libs/elasticsearch/langchain_elasticsearch/cache.py Outdated Show resolved Hide resolved
libs/elasticsearch/langchain_elasticsearch/cache.py Outdated Show resolved Hide resolved
Gabriele Ghisleni added 2 commits May 24, 2024 09:42
@maxjakob
Copy link
Collaborator

@GabrieleGhisleni no rush on this at all from my side, but feel free to tag me once the merge conflicts are resolved and this is ready for review again :)

Gabriele Ghisleni added 2 commits May 30, 2024 08:26
@GabrieleGhisleni
Copy link
Contributor Author

@maxjakob great! the conflicts should be fixed

Copy link
Collaborator

@maxjakob maxjakob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvements! I left some more comments for cosmetic changes.

libs/elasticsearch/README.md Outdated Show resolved Hide resolved
libs/elasticsearch/tests/conftest.py Outdated Show resolved Hide resolved
libs/elasticsearch/tests/unit_tests/test_imports.py Outdated Show resolved Hide resolved
libs/elasticsearch/langchain_elasticsearch/__init__.py Outdated Show resolved Hide resolved
libs/elasticsearch/langchain_elasticsearch/_utilities.py Outdated Show resolved Hide resolved
libs/elasticsearch/langchain_elasticsearch/_utilities.py Outdated Show resolved Hide resolved
libs/elasticsearch/langchain_elasticsearch/cache.py Outdated Show resolved Hide resolved
@GabrieleGhisleni
Copy link
Contributor Author

@maxjakob I should have fix the last suggestions. Let me know if it seems ok to you.

Copy link
Collaborator

@maxjakob maxjakob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work @GabrieleGhisleni! Thank you for your contribution.

@maxjakob maxjakob merged commit aa9c150 into langchain-ai:main May 30, 2024
11 checks passed
@maxjakob
Copy link
Collaborator

I will release this tomorrow.

baskaryan pushed a commit to langchain-ai/langchain that referenced this pull request Jun 6, 2024
…22612)

The package for LangChain integrations with Elasticsearch
https://github.com/langchain-ai/langchain-elastic contains a
Elasticsearch byte store cache integration (see
langchain-ai/langchain-elastic#27). This is the
documentation contribution on the page dedicated to stores integrations

Co-authored-by: Gabriele Ghisleni <[email protected]>
hinthornw pushed a commit to langchain-ai/langchain that referenced this pull request Jun 20, 2024
…22612)

The package for LangChain integrations with Elasticsearch
https://github.com/langchain-ai/langchain-elastic contains a
Elasticsearch byte store cache integration (see
langchain-ai/langchain-elastic#27). This is the
documentation contribution on the page dedicated to stores integrations

Co-authored-by: Gabriele Ghisleni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants